You will hopefully already be familiar with the concept of sampling, and why we do it.
In this course we will specifically look at examples used for environmental data.
Some of this will be revision, but there may also be some new methods.
We use samples in environmental & ecological data in situations where it is not possible to measure the entire population.
In environmental settings, this could be because:
The population is too large.
Some or all of the population is difficult, expensive or even impossible to reach.
The samples may be destructive, i.e. taking the sample causes permanent damage to the object being measured.
We want to use the information we obtain on the sample in order to make inference on the population.
Population
The population is the set of all possible objects that could be sampled.
Sampling Units
The sampling units are the members of the population, i.e. the objects that could be sampled.
Sample
The sample is a collection of sampling units, i.e. a subset of the population.
When we design an environmental or an ecological study we should focud on these steps:
We need to define clear and simple objectives for our study.
Example:
What is the spatial or temporal variability of water quality across a River Surveillance Network (RSN)?
Example: Water quality
Target population: RSN 1:250k with over 1.4 million reaches (a discrete segment of a river with relatively uniform characteristics) .
Characterise environmental conditions of the target population such as Water Quality Indicators, i.e., we need to define our response variable:
Macroinvertebrates composition obtained from the RICT Model 44 network (1:50k scale and trimmed to match the RSN network). E.g., WHPT-ASPT (Walley Hawkes Paisley Trigg Average Score Per Taxon) is a biological metric used to evaluate the ecological health of rivers based on the presence and sensitivity of macroinvertebrate (e.g., insects, worms, snails) communities.
Orthophosphate \([\text{PO}_4]^{3-}\) concentrations (mg/L)
There are a number of sampling designs which are commonly used for environmental data:
We will discuss some of these in more detail during the course.
Data collection - what information is being collected and how?
Biological elements - Macrophytes, macroinvertebrates, diatoms.
River habitat survey - Physical habitat essential for fish, macrophytes and invertebrate to live.
Physico-chemical elements - Water quality elements like dissolved oxygen, orthrophosphate, nitrogen
Toxic chemicals - Harmful (potentially banned) chemicals
Invasive non-native species - Plants, animals, fungi, or organisms that have been introduced to a new area where they are not native.
Physical properties - Temperature, width, slope, altitude, etc.
Implementation - Deploying the network and measuring the quantities of interest. Some practical challenges include:
River is dry
Route issues
Site is overgrown, fenced or with barbed wire
Steep or high banks
Land owner permission
Safety issues.
Often statisticians will not actually carry out the sampling. We will rely on field experts in many cases.
Once we receive the data, it’s important to assess the data for censoring, outliers, missingness etc.
We can then fit an appropriate statistical model.
Finally, we should report our results in clear language, including uncertainty where appropriate.
We are interested in population parameter(s) \(\theta\) .
Typically, the value of \(\theta\) is unknown and it is unfeasible to measure all \(N\) elements of the population.
We are interested in population parameter(s) \(\theta\) .
Typically, the value of \(\theta\) is unknown and it is unfeasible to measure all \(N\) elements of the population.
We select a representative sample and measure \(n < N\) units to estimate it.
The question now, is how do we select such units?
We are interested in population parameter(s) \(\theta\) .
Typically, the value of \(\theta\) is unknown and it is unfeasible to measure all \(N\) elements of the population.
We select a representative sample and measure \(n < N\) units to estimate it.
The question now, is how do we select such units?
A sampling strategy integrates both sample selection methods from a target population and estimation techniques to infer population attributes from sample measurements.
Def. Consistency
An estimator \(\hat{\theta}\) of \(\theta\) is said to be consistent if for any \(\epsilon >0\)
\[ \lim_{n\to\infty} \mathbb{P}(|\hat{\theta}-\theta| > \epsilon) = 0 \]
Expected value
The expected value of an estimator is a weighted average of all possible estimates.
\[ \mathbb{E}(\hat{\theta})= \sum_{x\in\Omega} p(s)\hat{\theta}(s) \]
It is a function of both, the sampling design (due to inclusion probabilities \(p(s)\)) and the population being sampled (through the sample estimate \(\hat{\theta}(s)\))
Bias
The difference in magnitude between its expected value and the population parameter
\[ \textbf{Bias}(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta \]
Variance
Average squared distance between individual estimates \(\hat{\theta}(s)\) and their expected value \(\mathbb{E}(\hat{\theta})\)
\[ \text{Var}(\hat{\theta}) = \sum_{s\in \Omega} p(s) \left[\hat{\theta}(s)-\mathbb{E}(\hat{\theta})\right]^2 \]
Precision
The precision of an estimator is a qualitative measurement that assess how small or large the variability of an estimator is
As the name suggests, this is the simplest form of sampling.
Every object in our population has an equal probability of being included in the sample.
This requires us to have a complete list of the population members, or a sampling frame covering the entire region.
We then generate a set of n random digits which identify the individuals or objects to be included in a study
\[\bar{y} = \frac{\sum_{i=1}^n y_i}{n}.\]
\[s^2 =\frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n-1}.\] - As well as estimating the population mean and variance, we also have to think about the uncertainty surrounding these estimates.
We then ensure that that each of these strata are represented proportionally within our sample (known as proportional allocation.
Let \(N_1, \ldots, N_L\) be the populations of our \(L\) strata, and \(n_1, \ldots, n_L\) be the number of samples taken from each.
It is straightforward to obtain sample means \(y_1, \ldots, y_L\) and sample variances \(s_1^2, \ldots, s_L^2\) for each stratum.
Then we compute the overall sample mean as \[\bar{y} = \frac{\sum_{l=1}^L \left( N_l \ y_l \right)}{N}.\]
We can also compute the variance of the sample mean as
\[\text{Var}\bar{y} = \sum_{l=1}^L \left[ \left(\frac{N_l}{N}\right)^2 \frac{s_l^2}{n_l} \left(1 - \frac{n_l}{N_l} \right) \right].\]
Systematic sampling is a sampling method which makes use of a natural ordering that exists in data.
We wish to take a sample of size \(n\) from a population of size \(N\), which means every \(k = \frac{N}{n}\) objects are sampled.
For systematic sampling, we select our first unit at random, then select every \(k\)th unit in a systematic way.
For example, if we have \(N=50\) and \(n=5\), then \(k=10\).
If our first unit is 2, our sample becomes units 2, 12, 22, 32, 42
| Advantages 😁 | Disadvantages 😔 |
|---|---|
| Convenient and quick. | May not be representative. |
| Well spaced across the study. | Systematic patterns in the data can be overlooked. |
| Sort of random — every object has an equal chance of selection. | Extremely deterministic — estimation of variance particularly difficult. |
Spatial sampling often uses a systematic sampling scheme based on transects.
A transect is a straight line along which samples are taken.
The starting point, geographical orientation and number of samples are chosen as part of the sampling scheme.
Samples will then be either taken at random points along the length of the line (continuous sampling) or systematically placed points (systematic sampling).
Suppose we need to take samples of water quality on a lake.
Our sampling scheme may use multiple transects simultaneously.
Is a popular method in ecology for estimating animal abundance.
Data are obtained by measuring the perpendicular distances from a transect line to detected individuals.
The probability of detection decreases with increasing distances via a parametric function.
The quadrats shown below were used to study orangutan nests in a region of Borneo.
The aligned and centrally aligned grids are convenient but may miss systematic patterns in the data.
The unaligned grid avoids this, and combines the advantages of simple random sampling and stratified sampling. However, it can be inefficient for collection.
The triangular grid can perform well in specific cases where the spatial correlation structures varies with direction.
Aim: Estimate the average level of chlorophyll-a in Lake Balaton, Hungary.
Levels are heavily affected by differences in the levels of nutrients along the length of the lake (known as a “trophic gradient”).
\[\bar{x} \pm t_{1-\alpha/2} \sqrt{\text{Var}(\bar{x})}.\]
The width of the interval is determined by the estimated standard error, \(\sqrt{\text{Var}(\bar{x})}\), and we know the formula for this contains \(n\).
Therefore, if we know how wide we need our interval to be (i.e. we know the required precision), we can calculate the \(n\) required to do that.
Let our maximum required standard error be denoted as \(U\). Then we need to compute: \[ \begin{aligned} \sqrt{\text{Var}(\bar{x)}} \leq U \\ \frac{\sqrt{s^2}}{\sqrt{n}} \leq U \\ \sqrt{n} \geq \frac{\sqrt{s^2}}{U} \end{aligned} \]
Here, \(s^2\) is the sample variance.
Can anyone see a problem here?
\[n \geq \left(\frac{s}{u} \right)^2 = \left(\frac{3.19}{0.1} \right)^2 = 1018.\]
For stratified sampling, this process is much more complicated.
Now let \(\omega_l = N_l/N\) be the proportion of the overall population which is found within stratum \(l\).
Also let \(\sigma_l\) be the standard deviation for the population of stratum \(l\).
We can then compute the optimum number of samples in each stratum as
\[n_l = n \ \frac{\omega_l \sigma_l /\sqrt{c_l}}{\sum_{k=1}^L \omega_k \sigma_k /\sqrt{c_k}}\]
\[n_l = n \ \frac{\omega_l \sigma_l}{\sum_{k=1}^L \omega_k \sigma_k}\]
A monitoring network is a set of stations placed across a region of interest to gather information about one or more environmental resource.
The Countryside Survey is a census of the natural resources of the UK’s countryside.
The first full survey was in 1978, and it was taken again at 6–10 year intervals until 2019.
Since 2019, it has been funded as a “rolling” survey, measuring locations on 5-yearly cycles.
The goal is to map changes at various different scales, as well as to understand what is driving those changes.
Generalised Randomised Tesselation Stratified
Is a form of spatially balanced probability sampling scheme
Example: Lakes monitoring
There are \(N= 16\) main lakes and a sample of \(n=4\) is desired.
Assuming equal sampling probabilities the inclusion probabilities for each lake re \(n/N = 4/16 = 0.25\)
the colored cells indicate site where small (blue) and large (red) lakes are present.
Transform the level \(k\) grid cell to a one-dimensional number line by sorting the cells hierarchically (starting from the first-level label).
Use systematic sampling along the line to select the resources to survey. E.g., draw \(u_1 \sim \text{U}(0,1)\) and select \(s_1\) as the first site to sample. The following next \(j = 2,\ldots,n\) sites are selected according to \(u_j = u_{j-1} +1\).
Transform the level \(k\) grid cell to a one-dimensional number line by sorting the cells hierarchically (starting from the first-level label).
Use systematic sampling along the line to select the resources to survey. E.g., draw \(u_1 \sim \text{U}(0,1)\) and select \(s_1\) as the first site to sample. The following next \(j = 2,\ldots,n\) sites are selected according to \(u_j = u_{j-1} +1\).
E.g., Suppose we would like larger lakes to be twice as likely to be selected as small lakes.
Instead of given all lakes the same unit length we can give large lakes twice the unit length of small lakes.
In addition to unequal inclusion probabilities we can also perform stratified sampling.
Instead of sampling from the entire sampling frame simultaneously, we divide a sampling frame into distinct sets of sites and select samples from each stratum independently
The GRTS algorithm is applied to each strata to obtain stratum-specific samples.
The R-package spsurvey implements GRTS algorithm to select spatially balanced samples via the grts() function.
It is generally very difficult to untangle the effects of a single event.
Even if we identify a change in the mean or variance, how do we know that it is due to our event?
Many environmental systems change naturally over time for any number of reasons.
We don’t have a statistical control. (We can’t turn back the clock and check what would have happened without the event.)
Statistical model
\[X_{ik} = \mu + \alpha_i + \tau_{k(i)} + \varepsilon_{ik}\]
Assessing impact of intervention with data for multiple sites.
Select \(j = 1,\ldots,M\) sites in the impact area and sample before/after intervention.
Statistical model
\[X_{ijk} = \mu + \alpha_i + \tau_{jk(i)} + \delta_j + \varepsilon_{ijk}\]
Statistical model
\[X_{ij} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ij}\]
We use information we obtain from a sample to make inference on the population.
There are five steps to designing a sampling experiment:
Types of sampling include:
Simple Random Sampling
Stratified Sampling
Systematic Sampling
Spatial Sampling (including Transects, Distance Sampling, Quadrats & Grid Sampling )
Sample size depends on the intended power and precision.
E.g., if we want to estimate a mean value \(\bar{x}\), given a maximum required standard error \(U\) and sample variance \(s^2\), then sample size \(n\) must be such that:
\[
\sqrt{n} \geq \frac{\sqrt{s^2}}{U}.
\]
For stratified sampling, the optimum number of samples in each stratum \(l\) is calculated as:
\[
n_l = n \ \frac{\omega_l \sigma_l /\sqrt{c_l}}{\sum_{k=1}^L \omega_k \sigma_k /\sqrt{c_k}}
\]
A monitoring network is a set of stations placed in a region of interest to gather information about one or more environmental variables.
Sampling methods such as GRTS can be used to identify the sites/resources to be monitored.